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Most programs today are written not by professional software developers, but by people with expertise in 
other domains working towards goals for which they need computational support. For example, a teacher 
might write a grading spreadsheet to save time grading, or an interaction designer might use an interface 
builder to test some user interface design ideas. Although these end-user programmers may not have the 
same goals as professional developers, they do face many of the same software engineering challenges, 
including understanding their requirements, as well as making decisions about design, reuse, integration, 
testing, and debugging. This article summarizes and classifies research on these activities, denning the area 
of End-User Software Engineering (EUSE) and related terminology. The article then discusses empirical 
research about end-user software engineering activities and the technologies designed to support them. The 
article also addresses several crosscutting issues in the design of EUSE tools, including the roles of risk, 
reward, and domain complexity, and self-efficacy in the design of EUSE tools and the potential of educating 
users about software engineering principles. 
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1. INTRODUCTION 

From the first digital computer programs in the 1940s to today's rapidly growing 
software industry, computer programming has become a technical skill of millions. As 
this profession has grown, however, a second, perhaps more powerful trend has begun 
to take shape. According to statistics from the U.S. Bureau of Labor and Statistics, by 
2012 in the United States there will be fewer than 3 million professional programmers, 
but more than 55 million people using spreadsheets and databases at work, many 
writing formulas and queries to support their job [Scafndi et al. 2005]. There are also 
millions designing websites with Javascript, writing simulations in MATLAB [Gulley 
2006], prototyping user interfaces in Flash [Myers et al. 2008], and using countless 
other platforms to support their work and hobbies. Computer programming, almost as 
much as computer use, is becoming a widespread, pervasive practice. 

What makes these "end-user programmers" different from their professional coun- 
terparts is their goals: professionals are paid to ship and maintain software over time; 
end users, in contrast, write programs to support some goal in their own domains of 
expertise. End-user programmers might be secretaries, accountants, children [Petre 
and Blackwell 2007], teachers [Wiedenbeck 2005], interaction designers [Myers et al. 
2008], scientists [Segal 2007] or anyone else who finds themselves writing programs to 
support their work or hobbies. Programming experience is an independent concern. For 
example, despite their considerable programming skills, many system administrators 
view programming as only a means to keeping a network and other services online 
[Barrett et al. 2004]. The same is true of many research scientists [Carver et al. 2007; 
Segal 2007]. 

Despite their differences in priorities from professional developers, end-user pro- 
grammers face many of the same software engineering challenges. For example, they 
must choose which APIs, libraries, and functions to use [Ko et al. 2004]. Because their 
programs contain errors [Panko 1998], they test, verify and debug their programs. They 
also face critical consequences to failure. For example, a Texas oil firm lost millions of 
dollars in an acquisition deal through an error in a spreadsheet formula [Panko 1995]. 
The consequences are not just financial. Web applications created by small-business 
owners to promote their businesses do just the opposite if they contain bad links or 
pages that display incorrectly, resulting in loss of revenue and credibility [Rosson 
et al. 2005]. Software resources configured by end users to monitor non-critical medical 
conditions can cause unnecessary pain or discomfort for users who rely on them [Orrick 
2006]. 

Because of these quality issues, researchers have begun to study end-user program- 
ming practices and invent new kinds of technologies that collaborate with end users 
to improve software quality. This research area is called end-user software engineer- 
ing (EUSE). This topic is distinct from related topics in end-user development in 
its focus on software quality. For example, there have been prior surveys of novice 
programming environments [Kelleher and Pausch 2005], discussing systems that ei- 
ther help students acquire computing skills or enable the creation of computational 
artifacts; while quality is a concern in these contexts, this work focuses largely on 
learning goals. There have also been surveys on end-user programming [Sutcliffe and 
Mehandjiev 2004; Lieberman et al. 2006; Wulf et al. 2006], but these focus on the 
construction of programs to support other goals, but not on engineering activities 
peripheral to construction, such as requirements, specifications, reuse, testing, and 
debugging. 

In this article, these software engineering activities are our primary focus. We 
start by proposing definitions of programming, end-user programming, and end-user 
software engineering, focusing on differences in intents and priorities between 



ACM Computing Surveys, Vol. 43, No. 3, Article 21, Publication date: April 2011. 



The State of the Art in End-User Software Engineering 



21:3 



end-user programming and professional software development. We follow with a 
lifecycle-oriented treatment of end-user software engineering research, organizing 
more than a decade of research on incorporating requirements, design, testing, 
verification, and debugging into end users' existing work practices. We then discuss 
a variety of crosscutting issues in end-user software engineering research, including 
the role of risk, reward, and domain of practice on end users' decision-making, as well 
as strategies for persuading users to engage in more rigorous software engineering 
activities as part of their normal work. We also discuss individual factors, such as 
self-efficacy and gender, and their influence on how effectively people use EUSE 
tools. 

What we found in our review of these research efforts were two different histo- 
ries. First, studies of end-user software engineering concerns have had a consistently 
broad scope. Researchers have studied children [Petre and Blackwell 2007], middle- 
school students [Baker 2007; Kelleher et al. 2007], system administrators [Barrett et al. 
2004], people at home [Blackwell 2004], knowledge workers in large companies [Bogart 
et al. 2008; Scaffidi et al. 2006], interaction designers [Ko et al. 2004; Brandt et al. 2008; 
Myers et al. 2008], natural scientists [Carver et al. 2007; Segal 2007], software archi- 
tects [Lakshminarayanan et al. 2006], bioinformatics professionals [Letondal 2006], 
web designers [Rode and Rosson 2003], and even volunteers helping with disaster 
relief [Scaffidi et al. 2007]. 

In contrast to studies of EUSE, contributions to EUSE tools have historically had 
a narrow scope. Early work focused largely on spreadsheets and event-based com- 
puting paradigms and on perfective aspects of end-user software engineering, such 
as testing, verification and debugging. Part of this historical bias is due to the fact 
that doing research on a particular paradigm has required mature and flexible pro- 
gramming languages, platforms, and IDEs on which to build more helpful software 
engineering tools, and most end-user programming platforms have not exhibited these 
properties. More recently however, this bias has been eliminated, with recent work 
focusing on a much broader set of domains and paradigms, including the web, mobile 
devices, personal information management, business processes, and programming in 
the home. Researchers have also extended their focus from perfective activities to de- 
sign, including work on requirements, specifications, and reuse. Part of this shift is due 
to the advent of interactive web applications, as sharing code is becoming much more 
common. 

These trends, coupled with the fact that computing is rapidly being incorporated into 
an incredible array of human activities, suggest that EUSE research will become sim- 
ilarly diverse. This will pose many challenges for the field, since the various domains 
studied and supported by research may have little in common. This is also an oppor- 
tunity, however, for researchers to identify what is common across these diverse areas 
of practice. This article represents an effort at identifying some of these fundamental 
challenges, grounded in lessons from prior work. 

2. DEFINITIONS 

One contribution of this article is to identify existing terms in EUSE research and 
fill in terminology gaps, creating a well-defined vocabulary upon which to build future 
research. In this section, we start with a basic definition of programming and end with 
a definition of end-user software engineering. 

2.1. Programming and Programs 

We define programming similarly to modern English dictionaries, as the process of 
planning or writing a program. This leads to the need for a definition of the term 
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program. Some definitions of "program" are in terms of the language, in which the 
program is written, requiring, for example, that the notation be Turing complete, and 
able to specify sequence, conditional logic and iteration. However, definitions such as 
these are heavily influenced by the type of activity being automated. To remove these 
somewhat arbitrary constraints from the definition, for the purposes of this article we 
define a program as "a collection of specifications that may take variable inputs, and 
that can be executed (or interpreted) by a device with computational capabilities." Note 
that the variability of input values requires that the program has the ability to execute 
on future values, which is one way it is different from simply doing a computation 
once manually. This definition captures general purpose languages in wide use, such 
as Java and C, but also notations as simple as VCR programs, written to record a 
particular show when the time of day (input) satisfies the specified constraint, and 
combinations of HTML and CSS, which are interpreted to produce a specific visual 
rendering of shapes and text. It also captures the use of report generators, which take 
some abstract specification of the desired report and automatically create the finished 
report. 

2.2. End-User Programming 

We now turn to end-user programming, a phrase popularized by Nardi [1993] in her 
investigations into spreadsheet use in office workplaces. An end user 1 is simply any 
computer user. We then define end-user programming as "programming to achieve the 
result of a program primarily for personal, rather public use." The important distinction 
here is that program itself is not primarily intended for use by a large number of users 
with varying needs. For example, a teacher may write a grades spreadsheet to track 
students' test scores, a photographer might write a Photoshop script to apply the 
same niters to a hundred photos, or a caretaker might write a script to help a person 
with cognitive disabilities be more independent [Carmien and Fischer 2008]. In these 
end-user programming situations, the program is a means to an end and only one of 
potentially many tools that could be used to accomplish a goal. This definition also 
includes a skilled software developer writing "helper" code to support some primary 
task. For example, a developer is engaging in end-user programming when writing code 
to visualize a data structure to help diagnose a bug. Here, the tool and its output are 
intended to support the developers' particular task, but not a broader group of users or 
use cases. 

In contrast to end-user programming, professional programming has the goal of 
producing code for others to use. The intent might be to make money, or to write it 
for fun, or perhaps as a public service (as is the case for many free and open source 
projects). Therefore, the moment novice web designers move from designing a web page 
for themselves to designing a web page for someone else, the nature of their activity 
has changed. The same is true if the developer mentioned above decides to share the 
data structure visualization tool with the rest of his team. The moment this shift 
in intent occurs, the developer must plan and design for a broader range of possible 
uses, increasing the importance of design and testing, and the prevalence of potential 
bugs. 

It is also important to clarify two aspects of this "intent"-based definition. First, 
our definition is not intended to be dichotomous, but continuous. After all, there is no 
clear distinction between a program intended for use by five people and a program 



1 The "end" in "end user" comes from economics and business, where the person who purchases a software 
product may be different from the "end user" who uses it. Our use of the phrase in this article is more for 
historical consistency than because we need to make this distinction. 
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Table I. A Partial List of Class of People who Write Programs and the Kinds of Programs They Write 



Class of people 


Activities of programming and tools and languages used 


System administrators 


Write scripts to glue systems together, using text editors and scripting languages 


Interaction designers 


Prototype user interfaces with tools like Visual Basic and Flash 


Artists 


Create interactive art with languages like Processing (http://processing.org) 


Teachers 


Teach science and math with spreadsheets [Niess et al. 2007] 


Accountants 


Tabulate and summarize financial data with spreadsheets 


Actuaries 


Calculate and assess risks using financial simulation tools like MATLAB 


Architects 


Model and design structures using FormZ and other 3D modelers 


Children 


Create animations and games with Alice [Dann et al. 2006] and Scratch 


Middle school girls 


Use Alice to tell stories [Kelleher and Pausch 2006; Kelleher et al. 2007] 


Webmasters 


Manage databases and websites using Access, FrontPage, HTML, Javascript 


Health care workers 


Write specifications to generate medical report forms 


Scientists/engineers 


Use MATLAB and Prograph [Cox et al. 1989] to perform tests and simulations 


E-mail users 


Write e-mail rules to manage, sort and filter e-mail 


Video game players 


Author "mods" for first person shooters, online multiplayer games, and The Sims 


Musicians 


Create digital music with synthesizers and musical dataflow languages 


VCR and TiVo users 


Record television programs in advance by specifying parameters and schedules 


Home owners 


Write control schedules for heating and lighting systems with X10 


Apple OS X users 


Automate workflow using AppleScript and Automator 


Calculator users 


Process and graph mathematical data with calculator scripting languages 


Managers 


Author and generate data-base backed reports with Crystal Reports 



intended for fifty. Instead, the key distinction is that as the number of intended uses 
of the program increases, a programmer will have to increasingly consider software 
engineering concerns in order to satisfy increasingly complex and diverse constraints. 
Second, even if a programmer does not intend for a program to be used by others, 
circumstances may change: the program may have broader value, and the code which 
was originally untested, hacked together, and full of unexercised bugs may suddenly 
require more rigorous software engineering attention. 

While our definition of end-user programming is a departure from previously pub- 
lished definitions, we do so both to bring clarity to field and to discuss some of the 
underlying dimensions of historical use. For example, a number of connotations of 
the phrase have emerged in research, many using it to refer to "novice" program- 
ming or "nonprofessional" programming, or system design that involves the participa- 
tion of end users. Many have also used it to describe an individual's identity [Nardi 
1993]. We believe these connotations connate a number of related, but nonequivalent 
concepts. 

For example, consider its use as an identity. A wide variety of people may engage in 
end-user programming; Table I gives just a glimpse of the diversity of people's com- 
putational creations. While it is natural to use the phrase "end-user programmers" to 
describe these groups, it is not always accurate, because a persons' intent in program- 
ming can depend on their task. For example, an accountant may use spreadsheets at 
home to keep track of a family loan, but use and share a spreadsheet at work to manage 
annual tax preparation activities with other accountants. It would be inaccurate to call 
this accountant an end-user programmer in all situations, when the spreadsheet at 
home is intended for personal, short-term use, but the one at work is heavily main- 
tained and repeatedly used by multiple people. Thus, when we use the phrase "end-user 
programmer" in this article, we are indicating the intent behind a programming task, 
not a fundamental aspect of the programmer's identity. 

It is also important to not connate end-user programming with inexperience. Profes- 
sional developers with decades of experience can engage in end-user programming by 
writing code for personal use, with no intent to share their program with others. They 
may complete their end-user programming task more quickly and with fewer bugs, but 
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Fig. 1. Programming activities along dimensions of experience and intent. The diagram is not intended 
to suggest the distribution of programmers (as there are many more without experience than with), but 
simply the underlying dimensions that characterize programming activity. The upward slant in end-user 
programming indicates that people with more experience tend to plan for other uses of their code. 

they will not approach the work with the same quality goals that they would for pro- 
duction code. To illustrate this distinction, consider Figure 1, which portrays program- 
ming experience and intent as two separate dimensions. Computer science students 
and professional programmers code with the intent of creating software for people to 
use (or grade), but vary in their experience. Similarly, end-user programming involves 
programming for personal use, but can include a wide range of programming expertise 
(of course, there are many more inexperienced programmers than experienced ones; 
we are not arguing that the distribution of experience is uniform). 

Similarly, it is important to not conflate end-user programming with the use of 
"simple" languages. Experienced developers may use general purpose languages like 
C++ to achieve end-user programming goals, and in fact, many scientists do use such 
languages to do exploratory scientific analyses, without the intent of sharing the code 
or polishing it for future use [Segal 2007]. Similarly, experienced developers may use 
simple markup languages such as HTML and CSS to design commercial web sites. 
In general, end-user programming can involve a wide range of languages, from macro 
recording, to domain-specific languages, to conventional, general-purpose languages. 
The key distinction in the choice of language is whether it helps a person achieve their 
personal goal (e.g., "choosing the right tool for the job"). 

Related to the choice of language, end-user programming should not be conflated 
with the use of particular interaction technique for constructing code. Java programs 
can be written with a text editor or a visual editor [Ko and Myers 2006] and languages 
that are traditionally written with visual editors, such as dataflow languages like 
Yahoo Pipes (http://pipes.yahoo.com) and Prograph [Matwin and Petrzykowski 1985], 
can also be expressed in textual syntax. The use of visual code editors has more to do 
with the difficulty of learning and manipulating textual syntax than the intent of the 
programmer. 

There are a number of phrases related to "end-user programming" that are worth 
discussing relative to our definition. End-user development has been defined as "a set 
of methods, techniques, and tools that allow users of software systems, who are acting 
as nonprofessional software developers, at some point to create, modify, or extend a 
software artifact" [Lieberman et al. 2006]. This notion of end-user development also 
focuses on the use and adaptations of software over time, and focuses on elements of 
the software lifecycle beyond the stage of creating a new program. More specifically, 
mutual -development, co -development, and participatory design refer to activities in 
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which end users are involved in a system design, but may or may not be involved in its 
actual coding [Henderson and Kyng 1991; Costabile et al. 2009; Mackay 1990]. These 
accounts of organizational, social, and collaborative perspectives on end-user devel- 
opment offer several valuable perspectives on how people appropriate and customize 
software, most of which are beyond the scope of this survey 

Terms such as customization, configuring [Eagan and Stasko 2008], and tailoring 
[Trigg and B0dker 1994, Kahler 2001] include parameterization of existing programs, 
but not direct modification of a program's source code. Visual programming refers 
to a set of interaction techniques and visual notations for expressing programs. The 
phrase often implies use by end-user programmers, but visual notations are not always 
targeted at a particular type of programming practice. Domain-specific languages are 
programming languages designed for writing programs for a particular kind of context 
or practice. End-user programming may or may not involve such languages, since 
what defines end-user programming is the intent, not the choice of languages or tools. 
Finally, scripts and scripting languages are often distinguished from programs and 
programming languages by the use of machine interpretation rather than compilation 
and their "high-level" use in coordinating the functions of multiple programs or services. 
The phrase end-user programming, because it is often conflated with inexperience, 
often connotes the use of scripting languages since these languages have the reputation 
of being easier to learn. 

2.3. End-User Software Engineering 

With definitions of programming and end-user programming, we now turn to the cen- 
tral topic of this article, end-user software engineering. As we discussed in the previous 
section, the intent behind programming is what distinguishes end-user programming 
from other activities. This is because programmers' intents determine to what extent 
they consider concerns such as reliability, reuse, and maintainability and the extent to 
which they engage in activities that reinforce these qualities, such as testing, verifica- 
tion, and debugging. Therefore, if one defines software engineering as systematic and 
disciplined activities that address software quality issues, 2 the key difference between 
professional software engineering and end-user software engineering is the amount 
attention given to software quality concerns. 

In professional software engineering, the amount of attention is much greater: if a 
program is intended for use by millions of users, all with varying concerns and unique 
contexts of use, a programmer must consider quality regularly and rigorously in order 
to succeed. This is perhaps why definitions of software engineering often imply rigor. 
For example, IEEE Standard 610.12 defines software engineering as "the application 
of systematic, disciplined, quantifiable approaches to the development, operation, and 
maintenance of software." Systematicity, discipline, and quantification all require sig- 
nificant time and attention, so much so that professional software developers spend 
more time testing and maintaining code than developing it [Tassey 2002] and they of- 
ten structure their teams, communication, and tools around performing these activities 
[Ko et al. 2007]. 

In contrast, end-user software engineering still involves systematic and disciplined 
activities that address software quality issues, but these activities are secondary to 
the goal that the program is helping to achieve. Because of this difference in priorities 
and because of the opportunistic nature end-user programming [Brandt et al. 2008], 
people who are engaging in end-user programming rarely have the time or interest in 
systematic and disciplined software engineering activities. For example, Segal [2007] 



2 Because the meaning of the phrase software engineering is still under much debate, we use the definition 
from current IEEE standards. 
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Table II. Qualitative Differences Between Professional and End-User 
Software Engineering 



Software Engineering 
Activity 


Professional SE 


End-user SE 


Requirements 


explicit 


implicit 


Specifications 


explicit 


implicit 


Reuse 


planned 


unplanned 


Testing and Verification 


cautious 


overconfident 


Debugging 


systematic 


opportunistic 



reported on several teams of scientists engaging in end-user programming, finding 
that software itself is not valued, that the process of creating software was highly 
iterative and unpredictable, and that testing was not considered important relative 
other domain-specific risks [Segal 2007]. These differences are coarsely summarized 
in Table II, showing that end-user software engineering can be characterized by its 
unplanned, implicit, opportunistic nature, due primarily to the priorities and intents 
of the programmer (but perhaps also to inexperience). 

Given these differences, the challenge of end-user software engineering research is to 
find ways to incorporate software engineering activities into users' existing workflow, 
without requiring people to substantially change the nature of their work or their pri- 
orities. For example, rather than expecting spreadsheet users to incorporate a testing 
phase into their programming efforts, tools can simplify the tracking of successful and 
failing inputs incrementally, providing feedback about software quality as the user ed- 
its the spreadsheet program. Approaches like these, and the ones reported throughout 
the rest of this article, allow users to stay focused on their primary goals (teaching chil- 
dren, recording a television, making scientific discoveries, etc.), while still achieving 
software quality. 

It is important to note that we do not discuss the issue of educating users about 
software engineering practices in this article. Many of the techniques discussed in our 
review may have the side effect of teaching users about the importance of testing, for 
example, but this is not the primary goal of these techniques. There is a case to be made 
that anyone creating software with some potential for costly failure ought to engage in 
more rigorous and disciplined software engineering activities. This viewpoint and any 
research associated with it, is outside the scope of this article. 

3. END-USER SOFTWARE ENGINEERING RESEARCH 

End-user software engineering research is interdisciplinary, drawing from computer 
science, software engineering, human-computer interaction, education, psychology and 
other disciplines. Therefore, there are several dimensions along which we could discuss 
the literature in this area, including tools, language paradigm, research approach, 
and so on. However, because we aim to contrast EUSE with professional software 
engineering, we chose to organize the literature by software engineering activities 
commonly listed in software engineering textbooks (e.g., Ghezzi et al. [2002]). For each 
of these, however, we frame the discussion from the perspective of an end user. 

(1) Requirements. How the software should behave in the world. 

(2) Design and Specifications. How the software behaves internally to achieve the 
requirements. 

(3) Reuse. Using preexisting code to save time and avoid errors (including integration, 
extension, and other perfective maintenance). 

(4) Testing and Verification. Gaining confidence about correctness and identifying fail- 
ures. 

(5) Debugging. Repairing known failures by locating and correcting errors. 
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In our discussion of each of these, we will review research in understanding and 
supporting these activities, and characterize the historical emphasis on particular 
paradigms or end-user programming domains. In doing so, however, we do not imply 
that these activities take place in sequence; indeed, the waterfall model [Ghezzi et al. 
2002] is even less appropriate in end-user programming than it is in professional 
development. 

It is important to note that we do not explicitly discuss the implementation and 
use of end-user programs, even though these activities are a central part of end-user 
programming activity. Surveys of implementation issues in end-user programming 
have been discussed extensively in previous literature [Sutcliffe and Mehandjiev 2004; 
Kelleher and Pausch 2005; Lieberman et al. 2006; Wulf et al. 2006]. The use of end-user 
programs depends largely on for what purpose they were created and most end-user 
software engineering research has attempted to be independent of purpose. We do, 
however, discuss the maintenance of end-user programs in our discussion of sharing in 
Section 3.3. 



3.1. What Should My Program Do? — Requirements 

The term "requirements" refers to statements of how a program should behave in the 
world (as opposed to the internal behavior of a program, which is how it achieves these 
external concerns). For example, a requirement for a tax program might be "Create a 
properly formatted 1040 tax form based on my financial data." This is a statement of a 
desired result, but not of how the result is achieved. 

In considering the history of work on this activity, the contributions have largely 
focused on understanding the sources and types of requirements of different domains 
of end-user programming and contrasting these with the role of requirements in pro- 
fessional software engineering. 

For example, in professional software engineering, projects usually involve a require- 
ments gathering phase that results in requirements specifications. These specifications 
can be helpful in anticipating project resource needs and for negotiating with clients. 
For end-user software engineering, however, the notion of requirements has to be rein- 
vented. Because their motivations are not related to the software, but to some other 
goal, people engaging in end-user programming rarely have an interest in explicitly 
stating their requirements. This means they may be less likely to learn formal lan- 
guages in which to express requirements or to follow structured development method- 
ologies. Furthermore, in many cases, end users may not know the requirements at the 
outset of a project; the requirements may only become clear in the process of imple- 
mentation [Costabile et al. 2006; Fischer and Giaccardi 2006; M0rch and Mehandjiev 
2000; Segal 2007]. While this is also true in Agile development [Coplien and Harrison 
2004], Agile developers explicitly recognize that requirements will evolve and have 
tools and processes that plan for emergent requirements. In contrast, people engaging 
in end-user programming are unlikely to plan in this way. 

Another difference between requirements in professional and end-user software en- 
gineering is the source of the requirements. In professional settings, the customers and 
users are usually different people from the developers themselves. In these situations, 
requirements analysts use formal interviews and other methods [Beyer and Holtzblatt 
1998] to arrive at the requirements. End-user programmers, on the other hand, are 
usually programming for themselves or for a friend or colleague. Therefore, end-user 
software engineering is unlike other forms of software engineering, where the challenge 
of requirements definition is to understand the context, needs and priorities of other 
people and organizations. For end users, requirements are both more easily understood 
(because the requirements are their own) and more likely to change (because end users 
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may need to negotiate such changes only with themselves). Furthermore, end users' 
requirements are able to be implicit, and perhaps not even consciously recognized. 

The situation in which an end user programs also affects the type of requirements. 
For example, at an office, the requirements are often to automate repetitive operations 
(such as transferring or transforming pieces of information such as customer names, 
products, accounts, or documents). In this context, a decision to write a program at all 
corresponds directly to real investment since time is money. End users who become 
successful at automating their own work often find that their programs are passed 
on to others, whether by simple sharing of tools between peers [MacLean et al. 1990], 
or as a means for managers to define office procedures. These social contexts start 
to resemble the concerns of professional software developers, for whom requirements 
analysis extends to the definition and negotiation of work practices [Ko et al. 2007]. 

At home, end-user software engineering is seldom about efficiency (except in the case 
of office-like work that is done at home, such as taxes). Instead, typical tasks include 
automation of future actions, such as starting a cooker or recording television. It is of- 
ten the case that one member of a household becomes expert in operating a particular 
appliance, and assumes responsibility for programming it [Rode et al. 2005b; Blackwell 
2004]. In this context, requirements are negotiated within the social relations of the 
household, in a manner that might have some resemblance to professional software 
experiences. Sometimes there are no requirements to start with; for example, there 
is a long tradition of "tinkering," in which hobbyists explore ways to reconfigure and 
personalize technology with no definite end in mind [Blackwell 2006]. Even though 
these hobbyists might have envisioned some scenario of use when they made the pur- 
chase [Okada 2005], those motivations may be abandoned later. Instead, requirements 
evolve through experimentation, seeing what one can do, and perhaps motivated by 
the possibility of exhibiting the final product to others as a demonstration of skill and 
technical mastery. 

In online contexts, end users must often consider the users of the web site or service 
they are creating [Rode et al. 2006], demonstrating that the distinction between the 
intent behind end-user programming and professional programming is a continuum 
rather than a mutually exclusive categorization. Further, in some situations, require- 
ments are shared and negotiated, as happens with professional software developers. 
For example, Scaffidi et al. [2006] interviewed six Hurricane Katrina site develop- 
ers and found that three relied on teammates for evaluating what features should be 
present and whether the site was viable at all. In this same study, requirements were 
derived from beliefs about the users of the program. One writer on the aggregators' 
email distribution list recognized that this "loosey goosey data entry strategy" would 
provide end users with maximal flexibility Unfortunately, the lack of validation led to 
semantic errors that software propagated into the new database. 

In educational contexts, programming is often used as a tool to educate students 
about mathematics and science. What makes these classroom situations unique is 
how requirements are delivered to and adapted by students. For example, Rosson 
et al. [2002] describe a participatory design workshop in which pairs of students and 
senior citizens created simulation projects to promote discussion about community 
issues. In this situation, requirements emerged from interpersonal communication in 
conversation and then were later constrained by the capabilities of the simulation tool. 
This contrasts with a classroom study of AgentSheets [Ioannidou et al. 2003], in which 
small groups of elementary school students followed a carefully designed curriculum 
to design biological simulations. In this situation, the instructions set the scope of the 
programming and students chose the detailed requirements within this scope. In other 
contexts [Niess 2007], the teachers and the students are end-user programmers. The 
degree to which the teachers understood the abilities and limitations of spreadsheets 



ACM Computing Surveys, Vol. 43, No. 3, Article 21, Publication date: April 2011. 



The State of the Art in End-User Software Engineering 



21:11 



affected not only the requirements they developed in lab activities, but also the degree 
to which the students understood the abilities and limitations of spreadsheets. 

In general, research has not attempted to explicitly support requirements capture, 
and the studies we have discussed should help reveal why There are some techniques, 
however, that can be viewed as a form of requirements elicitation. For example, the 
Whyline [Ko and Myers 2004], which allows users to ask "why" questions about their 
program's output, is an implicit way of learning about what behavior the user intended 
and did not intend. The same is true of the goal debugging work in the spreadsheet 
paradigm [Abraham and Erwig 2007b], which allows users to inquire about incorrect 
values in spreadsheet output. Both of these systems are a form of requirements elici- 
tation, in which the requirements are used to support debugging. 

3.2. How Should My Program Work? — Design and Design Specifications 

In software engineering, design specifications specify the internal behavior of a system, 
whereas the requirements are external (in the world). In professional software engi- 
neering, software designers translate the ideas in the requirements into design specifi- 
cations. These specifications can be helpful in coordinating implementation strategies 
and ensuring the right prioritization of software qualities such as performance and re- 
liability. Design processes can ensure that all of the requirements have been accounted 
for. 

In end-user programming, the challenge of translating one's requirements into a 
working program can be daunting. For example, interview studies of people who wanted 
to develop web applications revealed that people are capable of envisioning simple 
interactive applications, but cannot imagine how to translate their requirements into 
working applications [Rosson et al. 2005]. Further, in end-user software engineering, 
the benefits of explicit design processes and specifications may be unclear to users. Most 
of the benefits of being explicit come in the long term and at a large scale, whereas 
end users may not expect long-term usage of their programs, even though this is not 
particularly accurate. For example, studies of spreadsheets have shown that end users 
are creating more and more complex spreadsheets [Shaw 2004], with typical corporate 
spreadsheets doubling in size and formula content every three years [Whittaker 1999]. 

In general, research on incorporating specifications into end-user programming has 
been quite pragmatic. If systems have supported any form of specifications, they have 
been used (1) to support a particular kind of design process, such as prototyping or 
exploratory activities, (2) as the primary programming language, (3) or as an inter- 
mediate language that either makes it easier to generate correct programs or helps 
with program validation. Most of these technologies have focused on improving the 
creation and validation of spreadsheets, the prototyping of web sites, and the expres- 
sion of preferences in the privacy domain. There is considerably less work on model- 
and specification-based approaches for interactive and web-based applications, though 
this is beginning to change. In the rest of this section, we review these approaches in 
light of the various imbalances and biases. 

3.2.7. Design Processes. Software design processes constrain how requirements are 
translated into design specifications and then implementations. They also involve a 
careful consideration of tradeoffs among conflicting goals such as reliability, maintain- 
ability, performance and other software qualities. These processes are usually learned 
by professionals through experience or training. Many end-user programmers, how- 
ever, are "silent designers" [Gorb and Dumas 1987], with no training in design and 
often seeing no benefit to ensuring such qualities. 

Some have proposed dealing with this lack of design experience by enforcing partic- 
ular design processes. For example, Ronen et al. [1989] propose a design process that 
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Fig. 2. Links between web site content, sketched in DENIM, an informal web site sketching tool. (Reprinted 
from Newman et al. [2003] with permission from authors.) 

focuses on ensuring that spreadsheets are reliable, auditable, and safe to update (with- 
out introducing errors). Powell and Baker [2004] define strategies and best practices 
for spreadsheet design to improve the quality of created spreadsheets. Outside of the 
spreadsheet domain, Rosson et al. [2007] tested a design process with end-user web 
programmers based on scenarios and concept maps, finding that the process was use- 
ful for orienting participants towards particular design solutions. One problem with 
dictating proper design practices is that end-user programmers often design alone, 
making it difficult to enforce such processes. 

An alternative to enforcing good behavior is to let end users work in the way they 
are used to working, but inject good design decisions into their existing practices. One 
crucial difference between trained software engineers' and end users' approaches to 
problem solving is the extent to which they can anticipate design constraints on a solu- 
tion. Software engineers can use their experience and knowledge of design patterns to 
predict conflicts and dependencies in their design decisions [Lakshminarayanan et al. 
2006]. End-user programmers, however, often come to understand the constraints on 
their programs' implementations only in the process of writing their program [Fischer 
and Giaccardi 2006]. 

Because end-user programmers' designs tend to be emergent, like their require- 
ments, requirements and design in end-user programming are rarely separate activi- 
ties. This is reflected in most design approaches that have been targeted at end-user 
programmers, which largely aim to support evolutionary and exploratory prototyping, 
rather than up-front design. For example, DENIM, a sketching system for designing 
web sites, allows users to leave parts of the interface in a rough and ambiguous state 
[Newman et al. 2003]. This characteristic is called provisionality [Green et al. 2006], 
where elements of a design can be partially, and perhaps imprecisely stated. 

Another approach to dealing with end users' emergent designs is to constrain what 
can be designed to a particular domain. The WebSheets [Wolber et al. 2002] and Click 
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Fig. 3. A ViTSL template, specifying the underlying structure of a grading spreadsheet. The names appear 
in rows and the assignments appear in columns, with the ellipses indicating repetition. (Reproduced from 
Abraham and Erwig [2006c] with permission from authors.) 

[Rode et al. 2005a] environments both strive to aid users in developing web applications 
at a level of abstraction that allows the environment to generate database-driven web 
applications, rather than write the necessary code at a lower level. There are several 
other commercial systems that impose similar constraints on design within a certain 
domain. Yahoo Pipes (http://pipes.yahoo.com), for example, allows the composition, 
selection and refinement of RSS feeds, but limits these activities to a predefined set of 
operators. Apple's Automator is a similar system, enabling the construction of dataflow 
programs that process and operate on data with multiple applications, but limiting 
these operations to a predefined set. 

Supporting emergent designs under changing ideas of requirements can also be done 
by supporting asynchronous or synchronous collaborations between professional soft- 
ware developers and end-user programmers. Approaches that emphasize synchronous 
aspects view professional developers and end-user programmers as a team (e.g., 
Costabile et al. [2006] and Fischer and Giaccardi [2006]). On the other hand, in strictly 
asynchronous approaches, the professional developer provides tailoring mechanisms 
for end-user programmers, thereby building in flexibility for end-user programmers to 
adjust the software over time as new requirements emerge [Bandini and Simone 2006; 
Dittrich et al. 2006; Letondal 2006; Stevens et al. 2006; Won et al. 2006; Wulf et al. 
2008]. As Pipek and Kahler [2006] point out, tailorability is a rich area, including not 
only issues of how to support low-level tailoring, but also numerous collaborative and 
social aspects. 

3.2.2. Writing Specifications. In professional software engineering, one way to ensure 
that requirements have been satisfied is to write explicit design specifications and 
then have tools check the program for inconsistencies with these specifications. In 
general, tools and languages for expressing specifications tend to be declarative in 
nature, allowing users to express what they want to happen, but not necessarily how. 

In applying this idea to end-user software engineering, one approach is for a tool to 
require up-front design. For example, ViTSL separates the modeling and data-entry 
aspects of spreadsheet development [Erwig et al. 2005]. The spreadsheet model is 
captured as a template [Abraham et al. 2005] like the one in Figure 3. The ellipsis 
under row 3 indicates that the row can be repeated downwards; each row stores the 
scores of a student enrolled in the course. These templates can then be imported 
into a system called Gencel [Erwig et al. 2005, 2006], which can be used to generate 
spreadsheets that are guaranteed to conform to the model represented by the template. 
For example, an instance of the template in Figure 3 is shown in Figure 4. The menu 
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Fig. 4. An instance of the grade sheet template from Figure 3 loaded into Excel. The operations in the 
toolbar on the right utilize the spreadsheet's underlying structure to help users avoid introducing errors into 
the structure. (Reproduced from Abraham and Erwig [2006c] with permission from authors.) 

bar on the right allows the user to perform insertion and deletion, protecting the user 
against unintended changes. One limitation of this approach is that once a spreadsheet 
is generated from a template, edits to the generated spreadsheet cannot be propagated 
back to the template. (This same problem occurs in code generation systems in software 
engineering, where changes to the code are not reflected back to the specifications.) 

Some systems are intended to support the exploration of specifications by supporting 
modeling for a particular type of application. For example, Berti et al. [2006] describe 
CTTE, a system that helps users convert natural language descriptions of tasks and 
scenarios into a hierarchy of subtasks. This is essentially a modeling language that 
helps users to express the design and underlying workflow of a user interface. In a 
similar system, Lin and Landay [2008] describe Damask, a system that allows design- 
ers to prototype ubiquitous computing applications and test them with users. In both 
of these systems, the modeling languages were carefully designed with a particular 
domain and class of applications in mind. 

Most other systems that support specification writing are used for later verification 
and checking, rather than generating programs. For example, Topes [Scaffidi et al. 
2008] allowing users to define string-based data types that can be used to check the 
validity of data and operations in any programming language that stores information as 
strings. Other researchers have developed end-user specification languages for privacy 
and security. For example, Dougherty et al. [2006] describe a framework for expressing 
access-control policies in terms of domain concepts. These specifications are stated 
as "differences" rather than as absolutes. For example, rather than stating who gets 
privileges in a declarative form, the system supports statements such as "after this 
change, students should not gain any new privileges." Cranor et al. [2006] describe 
Privacy Bird, a related approach, which includes a specification language for users 
to express their privacy preferences in terms of the personal information being made 
accessible. Privacy Bird then uses these specifications to warn users about web sites' 
violations of these preferences. 

Finally, some approaches for writing specifications take a direct manipulation, what 
you see is what you get (WYSIWYG) approach, moving the description of behavior and 
appearance to the user's language, rather than a machine language. For example, many 
WYSIWYG web site tools allow users to directly manipulate the design of a web site and 
then let the tool generate the HTML and CSS for use in a web browser (most notably at 
the time of this writing is Adobe's Dreamweaver). To enable direct manipulation, such 
tools often limit the range of design possibilities to facilitate code generation, requiring 
users to learn the underlying language to express more complicated designs. 
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3.2.3. Inferring Specifications. One approach to the problem of how to support a design 
process is to eliminate it, replacing it with technologies that can determine require- 
ments automatically through various forms of inference. 

Several systems have used a programming by example approach to this problem 
(such systems are described in detail in Lieberman [2000]). These systems allow users 
to provide multiple examples of the program's intended behavior and the tool observes 
the behavior and attempts to generalize from it. For example, Abraham and Erwig 
[2006a] developed an approach for automatically inferring the templates discussed 
in the previous section from an example spreadsheet, allowing users more flexibility 
in redefining the spreadsheet template as requirements change. In the domain of 
event-based simulations, the AgentSheets environment [Repenning and Perrone 2000] 
lets the programmer specify that a new type of "part" is just like an existing part, 
except for its icon; the tool will then generate all of the instructions necessary for the 
new part. McDaniel and Myers [1999] describe an approach to inferring interaction 
specifications, allowing users to click and drag objects from one part of the screen to 
another to demonstrate a desired movement at runtime. 

Recent work has begun to apply programming by example to web sites. For example, 
Toomim et al. [2009] allow users to select example data from web sites and automati- 
cally generate a range of user interface enhancements. Nichols and Lau [2008] describe 
a similar system, which allows users to create a mobile version of a web site through 
a combination of navigating through the desired portion of the site and explicitly se- 
lecting content. Macias and Paterno [2008] take a similar approach, in which users 
directly modify the web page source code. These modifications are used as a specifica- 
tion of preferences, which are then generalized and applied to other pages on the same 
site. Yet another approach allows users to identify the same content with multiple 
markings, increasing the robustness of the inference [Lingam and Elbaum 2007]. 

One problem with programming by example approaches is that the specifications 
inferred are difficult to reuse in future programs, since most systems do not package 
the resulting program as a reusable component or a function There are some exceptions 
to this, however. For example, Scaffidi et al. [2007] describe an approach to inferring 
data type specifications from unlabeled textual examples and then allowing users to 
review and customize the specification. The specification can then be easily packaged 
and reused for use in other applications. Another counter example is Smith et al. [2000]. 

An alternative to programming by example is to elicit aspects of the specification 
directly from end users. Burnett et al. [2003] describe an approach for spreadsheets, 
attaching assertions to each cell to specify intended numerical values. In this approach, 
seen in Figure 5, users can specify an intended range of a cell's value at any time. 
Then, the system propagates these ranges through cell formulas, allowing the system 
to further reason about the correctness of the spreadsheet. If a conflict is found between 
a user-generated assertion and a system-generated assertion, the system circles the 
two assertions to indicate the conflict. This assertions-based approach has been shown 
to increase people's effectiveness at testing and debugging [Wilson et al. 2003; Burnett 
et al. 2003]. Scaffidi describes a similar approach for validating textual input; we 
describe this approach in Section 3.3.4. 

Other approaches take natural language descriptions of requirements and attempt 
to translate them into code. For example, Liu and Lieberman [2005] describe a system 
that takes descriptions of the intended behavior of a system and generates Python dec- 
larations of the objects and behaviors described in the descriptions. Little and Miller 
[2006] developed a similar approach for Chickenfoot [Bolin et al. 2005] (a web scripting 
language) and Microsoft Word's Visual Basic for Applications. Their approach exploits 
the user's familiarity with the vocabulary of the application domain to express com- 
mands in that domain. Users can state their goals in terms of the domain keywords 
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Fig. 5. An assertion conflict in Forms/3. The user wrote the assertion on the Celsius cell (0 to 100), which 
conflicts with the computer generated assertion (0 to 500). This prompts the user to check for errors in the 
cells' formulas [Burnett et al. 2003]. Original figure obtained from authors. 



that they are familiar with and the system generates the code. Other systems have 
attempted to teach commands to users when their effect may not be visible, as in the 
case of scriptable group ware applications [Wulf 2000]. 

3.3. What Can I Use to Write My Program?— Reuse 

Reuse generally refers to either a form of composition, such as "gluing" together compo- 
nents APIs, or libraries, or modification, such as changing some existing code to suit a 
new context or problem. In professional programming, most of the code that developers 
write involves reuse of some sort, whether copying code snippets, adapting example 
code, or using libraries and frameworks by invoking their APIs [Bellon et al. 2007]. 
Traditionally, the motivations for these various types of reuse are usually to save time, 
to avoid the risk of writing erroneous new code, and to support maintainability [Ye and 
Fischer 2005; Ravichandran and Rothenberger 2003]. 

While these practices are also true in end-user programming, in many ways they 
are made more difficult by end users' shift in priorities. In particular, finding, reusing, 
and even sharing code becomes more opportunistic, as the goals of reuse are more 
to save time and less to achieve other software qualities. Furthermore, in end-user 
programming, reuse is often what makes a project possible, since it may be easier for 
an end user to perform a task manually or not at all than to have to write it from 
scratch without other code to reuse [Blackwell 2002]. 

Prior work on reuse has largely focused on studies of reuse in more conventional 
programming languages with large APIs or object hierarchies. Consequently, many 
of the challenges that professionals face, end users face as well. Where these popu- 
lations differ is in how APIs, libraries, and frameworks must be designed to support 
a certain population. While APIs designed for professionals use often focus on opti- 
mizing flexibility, end users often need much more focused support for achieving their 
domain-specific goals. In this section, we characterize prior work on these different 
reuse activities and compare and contrast the role of reuse in end-user programming 
and professional development. 

3.3. 1. Finding Code to Reuse. A fundamental challenge to reuse is finding code and 
abstractions to reuse or knowing that they exist at all [Ye and Fischer 2005]. For 
example, Ko et al. [2004] found that students using Visual Basic.NET to implement 
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user interfaces struggled when trying to use search tools to find relevant APIs, and 
instead relied on their more experienced peers for finding example code or APIs. This 
is similar to Nardi's [1993] finding that people often seek out slightly more experienced 
coworkers for programming help. Example code and example programs are among 
the greatest sources of help for discovering, understanding, and coordinating reusable 
abstractions, both in professional programming [Rosson and Carroll 1996; Stylos and 
Myers 2006] and end-user programming [Wiedenbeck 2005; Rosson et al. 2005]. In 
many cases, the examples are fully functional, so the programmer can try out the 
examples and better understand how they work [Rosson and Carroll 1996; Walpole 
and Burnett 1997]. 

Researchers have also invented a number of tools to help search for both example 
code and APIs. For example, the CodeBroker system watches the programmer type 
code and guesses what API functions the programmer might benefit from knowing 
about [Ye and Fischer 2005]. Other systems also attempt to predict which abstractions 
will benefit a professional programmer [Mandelin et al. 2005; Bellettini et al. 1999]. 
Mica [Stylos and Myers 2006] lets users search using domain-specific keywords. While 
all of these approaches are targeted at experienced programmers, many of the same 
ideas are beginning to be applied to languages intended for end-user programming. 
For example, CoScripter's support for sharing and finding others' scripts not only helps 
users search for examples, but utilizes other meta data such as a users' social network 
to help users find relevant programs [Bogart et al. 2008]. 

3.3.2. Reusing Code. Even if end users are able to find reusable abstractions, in some 
cases, they may have difficulty using abstractions that were packaged, documented, and 
provided by an API. One study of students using Visual Basic.NET for user interface 
prototype showed that most difficulties relate to determining how to use abstractions 
correctly, coordinating the use of multiple abstractions, and understanding why ab- 
stractions produced certain output [Ko et al. 2004]. In fact, most of the errors that the 
students made had more to do with the programming environment and API, and not 
the programming language. For example, many students had difficulty knowing how 
to pass data from one window to another programmatically, and they encountered null 
pointer exceptions and other inappropriate program behavior. These errors were due 
primarily to choosing the wrong API construct or violating usage rules in the coordi- 
nation of multiple API constructs. Studies of end-user programming in other domains, 
such as web programming [Rode and Rosson 2003; Rosson et al. 2004], and numerical 
programming [Nkwocha and Elbaum 2005], have documented similar types of API 
usage errors. 

There are several ways of addressing mismatch between code and the desired func- 
tionality. In the case of code, one way is to simply modify the code itself, customizing 
it for a particular purpose. A special case of adapting such examples is the concept of 
a template. For example, Lin and Landay [2008], in their tool for prototyping user in- 
terfaces across multiple devices, provide a collection of design pattern examples [Beck 
2007] that users can adapt, parameterize, and combine. Some end-user development 
platforms, such as Adobe Flash, implement user interface components as modifiable 
templates. Templates have also been used to facilitate the creation of scripts for mobile 
devices to support the independencies of people with cognitive disabilities [Carmien 
and Fischer 2008]. 

The mismatch between code and desired functionality can sometimes be addressed 
through tailoring by the end user. In this case, the user supplies parameters, rules, 
or even more complex information to the component or application, thereby chang- 
ing its behavior. Tailoring enables component developers to offload some adaptive 
maintenance activities onto end users [Dittrich et al. 2006], who essentially become 
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asynchronous collaborators with the component developers [M0rch and Mehandjiev 
2000]. In order for a component to be tailorable, the component designer must engage 
in significant upfront planning; in particular, the designer must consider how users 
in the target population will differ from one another, then determine which aspects of 
the component should accordingly be tailorable [Dittrich et al. 2006]. There are several 
ways to uncover user needs, such as including users in the design and construction 
of the component [Letondal 2006], performing ethnographies [Stevens et al. 2006], or 
interviewing users about their likely requirements [Eagen and Stasko 2008]. Such ap- 
proaches assume that the users involved in component design can represent the diverse 
needs of the target population. Since users often vary not only in their requirements, 
but also in their level of tailoring skill, some component designers provide multiple 
mechanisms for tailoring, so that users with more skill can take advantage of more 
complex mechanisms in order to effect more specialized tailoring [Eagan and Stasko 
2008; M0rch and Mehandjiev 2000; Wulf 1999; Wulf et al. 2008]. Tailorability can be 
greatly facilitated by integrated support for collaborated tailoring [Pipek and Kahler 
2006], integrity checks for detecting tailoring mistakes [Won et al. 2006; Wulf et al. 
2008], and evaluation features for helping users to understand the impact of tailoring 
[Won et al. 2006; Wulf et al. 2008]. 

Actually modifying the source code of APIs, libraries and other kinds of abstractions 
is generally not possible, since the code for the abstraction itself is not usually available. 
The programmer can sometimes write additional code to adapt an API [DeLine 1999], 
but there are certain characteristics of APIs and libraries, such as performance, that 
are difficult to adapt. Worse yet, when an end-user programmer is trying to decide 
whether some API or library would be suitable for a task, it is difficult to know in 
advance whether one will encounter such difficulties (this is also true for professionals 
[Garlan et al. 1995]). 

Another issue for API and abstraction use is whether future versions of the abstrac- 
tion will introduce mismatch because of changes to the implementation of the API 
behavior. For example, ActionScript [dehaan 2006] (the programming language for 
Adobe Flash) and spreadsheet engine upgrades often change the semantics of existing 
programs' behavior. In the world of professional programming, one popular approach 
to detecting such changes is to create regression tests [Onoma et al. 1988]. Another 
possibility is to proxy all interactions with the API and log the API's responses; then, 
if future versions of the API respond differently, the system can show an alert [Rakic 
and Medvidovic 2001]. Regression testing has been used in relation to spreadsheets 
[Fisher et al. 2002b]; beyond this, these approaches have not been pursued in end-user 
development environments. 



3.3.3. Creating and Sharing Reusable Code. Thus far, we have discussed reusing existing 
code, but most end-user programming environments also provide ways for users to 
create new abstractions. Table III lists several examples of reusable abstractions, dis- 
tinguishing between behavioral abstractions and data abstractions. Studies of certain 
classes of end users suggest that data abstractions are the most commonly created type 
[Rosson et al. 2005; Scaffidi et al. 2006]. 

Although end users have the option of creating such reusable abstractions, exam- 
ples are the more common form of reusable code. Unfortunately, it is extremely time- 
consuming to maintain a well-vetted repository of code. For example, Gulley [2006] 
describes the challenges in maintaining a repository of user-contributed Matlab ex- 
amples, with a rating system and other social networking features. For this reason, 
many organizations do not explicitly invest in creating such repositories. In such cases, 
programmers cannot rely on search tools but must instead share code informally 
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Table III. Behavioral and Data Abstractions in Some End-User Programming Environments 



Environment 


Domain 


Behavioral abstractions 


Data abstractions 


AutoHAN 

[Blackwell and Hague 
2001] 


Home automation 


Channel Cubes can map to 
scripts that call functions on 
appliances. 


Aggregate Cubes can 
represent a collection of 
other Media Cubes. 


BOOMS 

[Balaban et al. 2002] 


Music editing 


Functions record series of 
music edits. 


Structures contain notes 
and phrases. 


Forms/3 

[Burnett et al. 2001] 


Spreadsheets 


Forms simultaneously 
represent a function and an 
activation record. 


Types are structured 
collections of cells and 
graphical objects. 


Gamut 

[McDaniel and Myers 
1999] 


Game design 


Behaviors are learned from 
positive and negative examples. 


Decks of cards serve as 
graphical containers with 
properties. 


Janus 

[Fischer and Girgensohn 
1990] 


Floor plan design 


Critic rules encode algorithms 
for deciding if a floor plan is 
"good." 


Instances of classes may 
possess attributes and 
sub-objects. 


KidSim 

[Smith et al. 1994] 


Simulation design 


Graphical rewrite rules 
describe agent behavior. 


Agents possess properties 
and are cloned for new 
instances. 


Lapis 

[Miller and Myers 2002] 


Text editing 


Scripts automate a series of 
edits. 


Text patterns can contain 
sub-structure. 


Pursuit 

[Modugno and Myers 
1994] 


File management 


Scripts automate a series of 
manipulations. 


Filter sets contain files and 
folders. 


QUICK 

[Douglas et al. 1990] 


UI design 


Actions may be associated with 
objects (that are then cloned). 


Objects may have attributes 
and be cloned and/or 
aggregated. 


TEXAO 

[Texier and Guittet 1999] 


CAD 


Formulas may drive values of 
attributes on cloneable objects. 


Instances of classes may 
possess attributes and 
sub-objects. 



[Segal 2007; Wiedenbeck 2005]. This spreads repository management across many 
individuals, who share the work of vetting and explaining code. 

Although it is common for end-user programmers to view the code they create as 
"throw away," in many cases such code becomes quite long-lived [Mackay 1990]. Some- 
one might write a simple script to streamline some business process and then later, 
someone might reuse the script for some other purpose. This form of "accidental" shar- 
ing is one way that end-user programmers face the same kinds of maintainability con- 
cerns as professional programmers. In field studies of CoScripter [Leshed et al. 2008; 
Bogart et al. 2008], an end-user development tool for automating and sharing "how-to" 
knowledge, scripts in the CoScripter repository were regularly copied and duplicated 
as starting points for new scripts, even when the original author never intended such 
use [Bogart et al. 2008]. This unplanned sharing also means that improvements to 
the originally copied or shared code do not propagate to the copies that need it. This 
distinction between planned and unplanned reuse can demand different reuse tech- 
nologies. For example, many professional development tools that support copying, such 
as the "linked editing" technology described by Toomim et al. [2004], require users to 
plan their copying activities. 

3.3.4. Designing Reusable Code for End Users. One way to facilitate reuse by end users 
is to choose the right abstractions for their problem domains. This means choosing 
the right concepts and choosing the right level of abstraction for such concepts. For 
example, the designers of the Alice 3D programming system [Dann et al. 2006] con- 
sciously designed their APIs to provide abstractions that more closely matched peoples' 
expectations about cameras, perspectives, and object movement. The designers of the 
Visual Basic.NET APIs based their API designs on a thorough study of the common 
programming tasks of a variety of programmer populations [Green et al. 2006]. The 
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Fig. 6. The Toped++ pattern editor [Scaffidi et al. 2008], allowing the creation of string data types that 
support recognition of matching strings and transformation between formats. (Original figure obtained from 
authors.) 



Google Maps API is related in that the key to its success has been the relative ease 
with which users can annotate geographical images with custom data types. 

In other cases, choosing the right abstractions for a problem domain involves under- 
standing the data used in the domain, rather than the behavior. For example, Topes 
[Scaffidi et al. 2008] is a framework for describing string data types unique to an organi- 
zation, such as room numbers, purchase order IDs, and phone number extensions (see 
Figure 6). By supporting the design of these custom data types, end-user programmers 
can more easily process and validate information, as well as transform information 
between different formats. This is a fundamental problem in many new domains of 
end-user programming, such as "mashup" design tools [Wong and Hong 2007] and RSS 
feed processors (e.g., http://pipes.yahoo.com). 

Of course, as with any design, designing the right abstractions has tradeoffs. 
Specializing abstractions can result in a mismatch between the functionality of a 
reusable abstraction and the functionality needed by a programmer [Ye and Fischer 
2005; Wiedenbeck 2005]. For example, many functional mismatches occur because 
specialized abstractions often have nonlocal effects on the state of a program [Big- 
gerstaff and Richter 1989]. In addition to functional mismatch, nonfunctional issues 
can cause abstractions not to mesh well with the new program [Ravichandran and 
Rothenberger 2003; Shaw 1995]. End-user software engineering research is only 
beginning to consider this space of API and library design issues. 

3.4. Is My Program Working Correctly? — Verification and Testing 

There is a large range of ways to gain confidence about the correctness of a program, 
including through verification, testing, or a number of other approaches. The goals of 
testing and verification techniques are universal: they enable people to have a more 
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objective and accurate level of confidence than they would if they were left unassisted. 
Where EUSE and professional SE differ is that end-user programmers' priorities often 
lead to overconfidence in the correctness of their programs. 

Research on testing and verification in end-user programming has primarily focused 
on helping end users manage their overconfidence, and primarily for the spreadsheet 
paradigm. 3 More recent work has broadened support for testing and verification to 
the web and researchers are also beginning to generalize the spreadsheet-focused 
technologies to other paradigms and domains. In this section, we discuss these various 
contributions and organize research by the different approaches to helping end-user 
programmers overcome overconfidence. 

3.4.7. Oracles and Overconfidence. A central issue for any type of verification is the 
decision about whether a particular program behavior or output is correct. The source 
of such knowledge is usually referred to as an oracle. Oracles might be people, making 
more or less formal decisions about the correctness of program behavior, or oracles can 
be explicitly documented definitions of the correct and intended behavior. 

People are typically imperfect oracles. Professional programmers are known to be 
overconfident [Leventhal et al. 1994; Teasley and Leventhal 1994; Lawrance et al. 
2005], but such overconfidence subsides as they gain experience [Ko et al. 2007]. Some 
end-user programmers, in comparison, are notoriously overconfident: many studies 
about spreadsheets report that despite the high error rates in spreadsheets, spread- 
sheet developers are heedlessly confident about correctness [Panko 1998, 2000; Hendry 
and Green 1994]. In one study, overconfidence about the correctness of spreadsheet cell 
values was associated with a high degree of overconfidence about the spreadsheets' 
overall correctness [Wilcox et al. 1997]. In fact, for spreadsheets, studies report that 
between 5% and 23% of the value judgments made by end-user programmers are incor- 
rect [Ruthruff et al. 2005a, 2005b; Phalgune et al. 2005]. In all of these studies, people 
were much more likely to judge an incorrect value to be right than a correct value to 
be wrong. 

These findings have implications for creators of error detection tools. The first is 
that immediate feedback about the values a program computes, without feedback about 
correctness, leads to significantly higher overconfidence [Rothermel et al. 2000; Krishna 
et al. 2001]. Second, because end users' negative judgments are more likely to be correct 
than positive judgments, a tool should "trust" negative judgments more. One possible 
strategy for doing so is to implement a "robustness" feature that guards against a 
large number of positive judgments swamping a small number of negative judgments, 
for example, as in Ruthruff et al. [2005b]. This approach was empirically tested in 
Phalgune et al. [2005] and was found to significantly improve the tool's feedback. 

3.4.2. Detecting Errors with Testing. One approach to helping end-user programmers de- 
tect errors is supporting testing. Testing is judging the correctness of programs based 
on the correctness of program outputs. Systematic testing — testing in accordance with 
a plan that defines exactly what tests are needed and when enough testing has been 
done — is crucial for success. Without it, the likelihood of missing important errors 
increases [Rothermel et al. 2001]. Furthermore, stronger (and more expensive) sys- 
tematic testing techniques have a demonstrated tendency to outperform weaker ones 
[Frankl and Weiss 1993; Hutchins et al. 1994]. Unfortunately, being systematic is 



One possible explanation for this bias is that Microsoft Excel, the most widely used spreadsheet language, 
tends to produce output even in the presence of errors. This then leads to user overconfidence in program 
correctness, which researchers have tried to remedy through better testing tools. Furthermore, much of the 
original research on end-user software engineering was inspired by Nardi's investigation of spreadsheet use 
in business contexts [Nardi 1993]. 
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Fig. 7. The WYSIWYT testing approach. Checkmarks represent decisions about correct values. Empty 
boxes indicate that a value has not been validated under the current inputs. Question marks indicate that 
validating the cell would increase testedness [Burnett et al. 2002]. (Original figure obtained from authors.) 



often in conflict with end-user programmers' goals, because it requires time on activi- 
ties that they usually perceive as irrelevant to success. Therefore, research on testing 
tools for end-user programmers has focused on testing approaches that are integrated 
with users' work and are incremental in their feedback. 

The most notable of these approaches is the "What You See Is What You Test" 
(WYSIWYT) methodology for doing "white box" testing of spreadsheets [Rothermel 
et al. 1998, 2001; Burnett et al. 2002]. With white box testing, the code is available 
to the tester [Beizer 1990]; in the case of spreadsheets, the formulas are the source 
code. Since testing most programs would require an infinite number of test cases in 
order to actually prove correctness, most white box approaches include a test adequacy 
criterion, which measures when "enough" testing has been done according to some 
code-based measure. Some criteria include branch coverage (test cases that exercise 
every branch), and statement coverage (exercising every statement in an imperative 
program) [White 1987]. With WYSIWYT, the criterion used is definition-use coverage, 
which (in the spreadsheet context) involves exercising every data dependency that 
could feasibly execute [Rothermel et al. 1998, 2001]. 

With WYSIWYT, as users develop a spreadsheet, they can also test that spread- 
sheet incrementally and systematically. At any point in the process of developing the 
spreadsheet, the user can validate any value that he or she believes is correct (the 
issues of oracles and overconfidence aside). Behind the scenes, these validations are 
used to measure the quality of testing in terms of a test adequacy criterion based on 
data dependencies. These measurements are then projected to the user using several 
different visual devices, to help them direct their testing activities. 

For example, suppose that a teacher is creating a student grades spreadsheet and has 
reached the point shown in Figure 7. During this process, whenever the teacher notices 
that a value in a cell is correct, she can check it off in the decision box in the upper 
right corner of that cell. A checkmark appears in the decision box, indicating that the 
cell's value has been validated under current inputs. The validated cell's border also 
becomes more blue, indicating that dependencies between the validated cell and cells it 
references have been "exercised" in producing the validated values. Red borders mean 
untested, blue borders mean tested, and any color in between means partially tested. 
From the border colors, the user is kept informed of which areas of the spreadsheet are 
tested and to what extent. The tool also supports more fine-grained access to testing 
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the data dependencies in the spreadsheet, as well as a "percent tested" bar at the top 
of the spreadsheet, providing the user with an overview of her testing progress. 

To help users think of values to test, users can invoke the "Help Me Test" utility to 
automatically generate suitable test values [Fisher et al. 2002a, 2006b]. This approach 
finds values that follow unexplored paths in the spreadsheet's dataflow, as well as reuse 
prior test case values for regression testing after a spreadsheet has changed. Abraham 
and Erwig [2006b] describe an alternative approach to generating test values by back- 
propagating constraints on cell values, showing that it can be more effective in terms 
of efficiency and predictability. 

WYSIWYT is the most mature error-detecting approach for end-user program- 
mers. It includes support for reasoning about regions of cells with shared formulas 
[Fisher et al. 2006b; Burnett et al. 2002] and also interacts with assertions (covered 
in Section 3.2), fault localization, debugging (covered in Section 3.5), reuse of prior 
test cases [Fisher et al. 2002b], and the "Help Me Test" functionality mentioned ear- 
lier. There has also been research into how WYSIWYT can be applied to visual dataflow 
languages [Karam and Smedley 2002] and to the kind of "screen transition" program- 
ming being developed for web page design [Brown et al. 2003]. 

3.4.3. Checking Against Specifications. Another approach to detecting errors in programs 
is by checking values computed by the program against some form of specification; 
these specifications then serve as the oracle for correctness. For example, as discussed 
in Section 3.2, one form of specifications that can be entered is assertions about the 
values that a spreadsheet cell can have. Such assertions can be propagated to infer 
new assertions, using interval arithmetic [Ayalew and Mittermeir 2003; Burnett et al. 
2003]. Assertions that conflict with one another are also highlighted, showing errors 
in the assertions or the formulas through which they propagated. Other approaches 
validate string input against flexible data type definitions [Scaffidi et al. 2008]. In all 
of these approaches, values that do not conform to the assertions are highlighted. 

Elbaum et al. [2005] describe an approach for capturing user session data from users 
who utilize web applications, and using this data to distill relevant testing information. 
The approach can be abstractly thought of as identifying specification information 
about a web application in the form of an operational abstraction of usage of that 
application. By focusing on usage, the approach allows verification relative to an (often 
shifting) operational profile; this can detect errors not foreseen by developers of the 
application, who often have unrealistic expectations about application usage. 

3.4.4. Consistency Checking. Instead of using a human oracle or external specifications, 
some systems define correctness heuristics about the internal consistency of a pro- 
gram's code. One approach for spreadsheets is a form of type inference called "unit 
inference and checking systems" [Chambers and Erwig 2009; Abraham and Erwig 
2004, 2007b; Ahmad et al. 2003; Antoniu et al. 2004; Coblenz et al. 2005]. These ap- 
proaches are based on the idea that users' layout of data, especially the labeled row 
and column headers, offer a form of user defined type called a unit [Erwig and Burnett 
2002]. For example, the label (column head) "apples" would represent entries of type 
apple. The "apples" label gets propagated to other formulas that use this value, and 
the labels are combined in different ways depending on the operator. The program 
can then be checked against these units for consistency. To illustrate, consider the 
spreadsheet in Figure 8, using the UCheck system [Abraham and Erwig 2004, 2007a]. 
Because a column is labeled "Apples," the entries in that column can be considered of 
unit Apples. The approach begins with an analysis of spatial layout, also taking into 
account referencing relationships in formulas, to determine the relationships among 
header labels for rows and columns, their relationship to data entries, and how far 
in the spreadsheet these labels apply. Because the row labeled "Total" contains sums 
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Fig. 8. The UCheck system for inferring units from headers. The arrows represent unit inferences based on 
the column and row labels [Abraham and Erwig 2007b]. 

of the "Apples" entries, the system decides that "Total" marks the end of the "Apples" 
column. 

The system can also reason about transformations that happen through formula 
operators/function calls, such as inferring that the sum of two Apples entries is also of 
type Apples, even if it is not in the Apples column. These inferences can be crosschecked 
for contradictions, and, just as in type inference, these contradictions are strong indi- 
cations of logic errors. For example, if the sum of two Apples entries occurs in the 
middle of the Oranges column, the system could consider this to be a case of conflicting 
type information and generate a unit error. Empirical studies suggest that end users 
are successful at using these features to detect errors [Abraham and Erwig 2007b]. 

Another form of internal consistency checking is statistical outlier finding, which 
involves identifying invalid data values that are mixed among a set of valid values. 
Miller and Myers [2001a] used this approach to help detect errors in text editing 
macros. Scaffidi [2007] developed a similar algorithm that infers a format from an un- 
labeled collection of examples that may contain invalid values. The generated format 
is presented in human-readable notation, so end-user programmers can review and 
customize the format before using it to find outliers that do not match the format. 
Raz et al. [2002] used anomaly detection to monitor on-line data feeds incorporated in 
web-based applications for possible erroneous inputs. All of these approaches use sta- 
tistical analysis and interactive techniques to direct end-user programmers' attention 
to potentially problematic values. 

3.4.5. Visualizations. Another way to check the correctness of a program is to visualize 
its behavior. Visualization tools enable end-user programmers to apply their knowledge 
of correctness to certain features of their program's behavior. For example, Igarashi 
et al. [1998] present comprehension devices that can aid spreadsheet users in dataflow 
visualization and editing tasks, and finding faults. More recent spreadsheet visual- 
ization approaches include detecting semantic regions and classes [Clermont 2003; 
Clermont and Mittermeir 2003; Fisher et al. 2006b], ways to visualize trends and "big 
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Fig. 9. A "data dependency flow" of a spreadsheet's dependencies. (Reproduced from Ballinger et al. [2003] 
with permission from authors.) 



picture" relationships in spreadsheets that elide a number of low-level details 
[Ballinger et al. 2003] (see Figure 9); and visual auditing features in which similar 
groups of cells are recognized and shaded based on formula similarity [Sajaniemi 2000]. 
This latter technique builds on earlier work on the Arrow Tool, a dataflow visualization 
device proposed by Davis [1996]. 

While most of these visualization tools have been created for the spreadsheet 
paradigm, there is a growing interest in allowing users to test and verify the behavior 
of machine learning algorithms. For example, Talbot et al. [2009] describe a system 
that allows users to partition data, view weights on different classifiers, and use a 
confusion matrix visualization to assess the behavior of the resulting classifier. Like 
the spreadsheet visualizations, this system focuses on portraying more global trends 
in the program behavior. 



3.5. Why is My Program Not Working? — Debugging 

Whereas verification and testing detect the presence of errors, debugging is the pro- 
cess of finding and removing errors. Debugging continues to be one of the most time- 
consuming aspects of both professional and end-user programming [LaToza et al. 2006; 
Ko and Myers 2005, 2007]. Although the process of debugging can involve a variety 
of strategies, studies have shown across a range of populations that debugging is fun- 
damentally a hypothesis-driven diagnostic activity [Brooks 1977; Littman et al. 1986; 
Katz and Anderson 1988; Robillard et al. 2004; Gugerty and Olson 1986; Wiedenbeck 
and Engebretson 2004; Ko et al. 2004]. What makes debugging difficult in general 
is that programmers typically begin the process with a "why" question about their 
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Fig. 1 0. After the teacher marks a few successful and unsuccessful test values, the system helps her narrow 
down the most likely location of the erroneous formula [Ruthruff et al. 2005b], (Original figure obtained from 
authors.) 



program's behavior, but must translate this question into a series of actions and queries 
using low-level tools such as breakpoints and print statements [Ko and Myers 2008]. 

A number of issues make debugging even more problematic for end-user program- 
mers. Many lack accurate knowledge about how their programs execute and, as a result, 
they often have difficulty conceiving of possible explanations for a program's failure [Ko 
et al. 2004]. Furthermore, because end users often prioritize their external goals over 
software reliability, debugging strategies often involve "quick and dirty" solutions, such 
as modifying their code until it appears to work. In the process of remedying existing 
errors, such strategies often lead to additional errors [Ko and Myers 2003; Beckwith 
et al. 2005a]. 

Although prior studies of debugging have focused on a broad set of domains and 
language paradigms, the technologies to support debugging have generally focused 
on spreadsheets and event-based imperative languages. In this section, we organize 
approaches to supporting debugging in light of this historical bias and discuss their 
potential to generalize beyond these paradigms. 

3.5. 1. Analyzing Dependencies. Dependencies in a program's execution can involve con- 
trol dependencies (such as a statement only executing if a particular condition is true) 
and data dependencies (such as a variable's depending on the sum of two other vari- 
ables) [Tip 1995]. Such dependencies are the basis of a number of end-user debugging 
tools. 

One approach in the spreadsheet domain is an extension to the WYSIWYT testing 
framework, which was discussed in Section 3.4 [Ruthruff et al. 2005b]. To illustrate, 
see Figure 10 and recall the grades spreadsheet example in Figure 7. Suppose in the 
process of testing, the teacher notices that row 5's Letter grade ("A") is incorrect. The 
teacher indicates that row 5's letter grade is erroneous by "X'ing it out" instead of 
checking it off. Row 5's Course average is also wrong, so she X's that one, too. As 
Figure 10 shows, both cells now contain pink (gray in this article), but Course is darker 
than Letter because Course contributed to two incorrect values (its own and Letter's) 
whereas Letter contributed to only its own. These colors reflect the likelihood that the 
cell formulas contain faults, with darker shades reflecting greater likelihood. Although 
this example is too small for the shadings to contribute a great deal, in one study, users 
who used the technique on larger examples did tend to follow the darkest cells and 
were better at finding bugs than those without the tool [Ruthruff et al. 2005a]. 
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To determine the colors from the X marks, three different algorithms have been used 
to calculate the WYSIWYT-based fault likelihood colorings [Ruthruff et al. 2006]. All 
three are based on variants of program slicing [Tip 1995]. The most effective algorithm 
is based on the sheer number of successful and failed test cases that have contributed 
to a cell's outcomes [Ruthruff et al. 2006]. Ayalew and Mittermeir [2003] devised a 
similar method of fault tracing in spreadsheets based on "interval testing" and slicing. 
This strategy reduces the search domain after it detects a failure, and selects a single 
cell as the "most influential faulty". It has some similarities to the assertions work 
presented in Section 3.2.3 [Burnett et al. 2003], but it not only detects the presence of 
possible errors, but also what cells are most likely to contain faulty formulas. 

A new class of tools based on question asking rather than on test outcomes has re- 
cently emerged and has proven effective. The first tool to take this approach was Ko 
and Myers' Why line [2004], which was prototyped for the Alice programming environ- 
ment [Dann et al. 2006] and is shown in Figure 11. Users execute their program, and 
when they see a behavior they have a question about, they press a "Why" button. This 
brings up a menu of "why did" and "why didn't" questions, organized according to the 
structure of the visible 3D objects manipulated by the program. Once the user selects 
a question, the system analyzes the program's execution history and generates an an- 
swer in terms of the events that occurred during execution. In a user study, the Whyline 
reduced debugging time by a factor of 8 and helped users get through 40% more tasks, 
when compared to users without the Whyline [Ko and Myers 2004]. In a similar ap- 
proach, Myers et al. [2006] describe a word processor that supports questions about 
the document and the application state (such as preferences about auto-correction and 
styles). This system enabled the user to ask the system questions such as "why was 
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teh replaced with the?" The answers were given in terms of the user-modifiable docu- 
ment and application state that ultimately influenced the undesirable behavior. Recent 
work has shown that these question-asking tools, particularly those that answer "why" 
questions, can help users understand software behavior more accurately and more in- 
depth than systems that support "what if" and "how to" types of questions [Lim et al. 
2009]. 

3.5.2. Change Suggestions. An entirely different approach to debugging goes a step 
further in automation. GoalDebug is a semi-automatic debugger for spreadsheets 
[Abraham and Erwig 2007a] that allows the user to select an erroneous value, give 
an expected value, and get a list of changes to the spreadsheet's formulas that would 
result in the cell having the desired value. Users can interactively explore, apply, refine, 
or reject these change suggestions. The computation of change suggestions is based on 
a formal inference system that propagates expected values backwards across formulas. 
Empirical results so far showed that the correct formula change was the first sugges- 
tion in 59% of the cases, and among the first five in 80% of the cases [Abraham and 
Erwig 2007b]. Of course, there will certainly be situations with such tools where the 
necessary change is far too complex for the system to infer. This approach also suf- 
fers from the oracle problem (Section 3.4.1), because it assumes that users can specify 
correct values. 

3.5.3. Sharing Reasoning. Given the variety of debugging tools that both detect and 
locate errors in spreadsheets, recent work has developed ways to combine the results 
of multiple techniques. For example, Lawrance et al. [2006] developed a system to com- 
bine the reasoning from UCheck [Abraham and Erwig 2004] and WYSIWYT [Ruthruff 
et al. 2005b]. The combined reasoning demonstrated both the importance of the infor- 
mation base used to locate faults and the mapping of this information into visual fault 
localization feedback for end-user programmers, replicating the findings of Ruthruff 
et al. [2005a]. They found that UCheck's static analysis of the spreadsheet effectively 
detected a narrow class of faults, while WYSIWYT (which was driven by a probabilistic 
model of users derived from previous work [Phalgune et al. 2005]) detected a broader 
range of faults with moderate effectiveness, and that certain combinations of the two 
were more effective than either alone. Additionally, by manipulating the mapping, they 
were able to improve the effectiveness of the feedback. 

3.5.4. Social and Cognitive Support. Aside from using tools to help end users debug, there 
are other approaches that take advantage of human and social factors of debugging. 
For example, a study of end-user debugging found that when end users worked in 
pairs rather than alone, they were more systematic and objective in their hypothesis 
testing [Chintakovid et al. 2006]. This approach was inspired by similar research on 
the benefits of pair programming for professional programmers. 

Kissinger et al. [2006] categorized people's comments during a debugging task in the 
lab, finding a number of questions that people ask of themselves, including "Am I smart 
enough?" "Is this the right value?" and "What should I do next?" These questions demon- 
strated the importance of supporting the individual's questions about planning and 
their meta-cognitive strategies, not just their questions about the debugging problem 
itself. These findings led to a video-based approach to teaching debugging strategies, 
in which a user could ask for help from videotaped human assistants [Subrahmaniyan 
et al. 2007; Grigoreanu et al. 2008]. In the study of this approach, participants chose 
better debugging strategies as a result of viewing the videos in the context of their 
problems, and had correspondingly more success at debugging. 
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4. CROSS-CUTTING ISSUES IN END-USER SOFTWARE ENGINEERING 

As illustrated in the previous sections, there has been significant progress in under- 
standing and relieving the software engineering challenges that arise when people 
engage in end-user programming. However, in addition to this collection of studies 
and technologies, researchers have also explored several crosscutting issues that affect 
the degree to which people engage in software engineering activities or use software 
engineering tools. For example, in addition to intent, there are many other factors that 
distinguish end-user programming from professional software development. There are 
also many issues in how well tools can generalize across paradigms and user groups. 
Additionally, simply creating EUSE tools is not enough: users must see the value in 
using them and they must be able to use them effectively, despite individual differ- 
ences in strategy. In this section, we organize prior work on these issues and discuss 
the potential for future work. 

4.1. Risk, Reward, and the Role of Domain 

While our definitions of end-user programming and end-user software engineering 
draw a clear line at the intent behind programming, this definition almost certainly 
does not capture the diversity of programming activities in the world. Some researchers 
have begun to document programming activities in different domains [Scaffidi et al. 
2006, 2007; Rosson and Kase 2006; Wiedenbeck 2005; Carver et al. 2007; Segal 2007; 
Myers et al. 2008; Petre and Blackwell 2007], finding that a wide range of people are 
engaging in programming to support their work or hobbies and not all of them are 
novice programmers. Some are skilled professionals, such as scientists and analysts, 
who also happen to have programming skills that they can apply to their work. It may 
not be meaningful to group these professionals together with novice programmers with 
similar intents, but different perceptions around learning new tools and a different 
willingness to redefine their work practices. 

Part of what underlies this distinction is that people vary in their tolerance and 
perceptions of risk and reward. For example, some teachers may not be willing to learn 
a new testing tool because they may not see the eventual payoff or may be skepti- 
cal about their own success. A financial analyst faced with performing thousands of 
manual transactions may see the situation differently. The costs involved in learning 
how to automate a task may be so high that it may seem more economical to find a 
manual alternative (or to persuade someone else to write the program). Blackwell's 
Attention Investment model [Blackwell and Green 1999; Blackwell 2002] provides a 
cognitive model of these insights, describing individuals' allocations of attention as 
cognitive "investments." According to the model, a user weighs four factors (not neces- 
sarily explicitly) before taking an action: (1) perceived benefits, (2) expected pay-off, (3) 
perceived cost, and (4) perceived risks. For example, imagine that an administrator in 
a small art museum might be considering adopting one of the spreadsheet verification 
tools in Section 3.4.2 to detect errors, because of recent problems in inventory tracking. 
The administrator might see a benefit in that the enhancement would allow her to 
find and fix errors more quickly. The expected pay-off is that inventory tracking will 
be dependable thus relieving her from the additional effort of supplementary audits. 
The perceived cost is that she will have to spend time learning to use the new features, 
while the perceived risk is that the features do not aid her enough to make it worth 
her effort. According to the Attention Investment model, her decision is based on a 
calculus of these factors. Some end-user programming techniques have attempted to 
support this decision making by measuring a trade-off point (e.g., how many items 
must be processed manually before the time saved is more than that needed to author 
the automation [Stylos et al. 2004; Miller and Myers 2001b]). 
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The irony of attention investment is that even this careful thought involves the 
investment of attentional effort. It might even be the case that truly rigorous analysis 
of requirements can be more costly than writing another program (a phenomenon 
that plagues the advocates of formal specification languages). Users who have a choice 
between programming or manual procedures are likely to avoid such careful analysis 
of requirements not because they are lazy or careless, but simply because it would be 
a poor investment of attention to do so much thinking in advance, rather than making 
iterative adjustments or simply reverting to manual procedures. A further risk in the 
attention investment equation is that the program may malfunction, failing to bring 
the anticipated benefits of automation, or perhaps even resulting in damage. The effort 
involved in testing or debugging to avoid this eventuality is yet another investment of 
attention. 

In addition to differences in perceptions of risk and reward, the domain of practice 
can also have a significant influence on how willing users are to engage in software 
engineering activities. For example, the domain complexity, or the types of concepts 
modeled by software, can vary in nature. Weather simulations, for instance, are likely 
more complex than a teacher's grading system and are likely to involve different types 
of computational patterns and different software architectures. A related factor is an 
end-user programmer's domain familiarity. This is the difference between a banker 
writing banking software and a professional programmer writing banking software. 
The banker would have to learn to program, whereas the professional would have to 
learn banking concepts. The domain for which a program is written and its underlying 
characteristics, are a fundamental part of Sutcliffe's exploration of reuse [Sutcliffe 
2002], in which he describes the role of granularity (how "large" an abstraction is) and 
abstraction (how it is partitioned). Both of these factors can influence how easily code 
can be appropriated for a particular task. 

Finally, people in different domains of practice may also collaborate differently. Pro- 
fessional developers work in teams [Ko et al. 2007], which can change the constraints 
on programming decisions, but this is often not the case in end-user programming. 
Teachers may work alone [Wiedenbeck 2005]; designers may work with other devel- 
opers [Myers et al. 2008]; web developers may work with users [Scafndi et al. 2007]. 
The cultural values around software development itself can also vary, influencing tool 
adoption and motivations to invest in learning software engineering concepts [Segal 
2005]. Further, the end user's organizational context imposes constraints and values 
of its own [Mehandjiev et al. 2006]. 

4.2. Persuading People to Use EUSE Tools 

Given the discussion in the previous section, and the lower priority of software engi- 
neering concerns in end-user programming, getting users to use EUSE tools at all is 
a significant challenge. End users may be reluctant to use new, unknown features of a 
system, because they may perceive the features as risky or unhelpful [Blackwell 2002]. 
Furthermore, because many people who engage in end-user programming lack train- 
ing in software engineering principles, they may not see immediate, or even long-term 
value in using software engineering tools. 

One approach to this problem is to train end-user programmers about software 
engineering and computer science principles rather than (or in addition to) trying to 
design tools around end users' existing habits. Some of the tools discussed in this survey 
have the side effect of training users, but do not explicitly intend to train. For example, 
a central issue in many of the tools described in this survey is the tension between 
formality and accessibility. With more explicit requirements, tests, and verifications, 
come more precise analysis, but more difficulty in expression (a related concept is 
provisionally [Green et al. 2006], which is the ability in a tool or notation to express 
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Fig. 1 2. The Surprise-Explain-Reward strategy in Forms/3. The changing colors surprise users, the tooltips 
explain the potential rewards, and the further changes in colors and the percent testedness har at the top 
are the rewards [Wilson et al. 2003]. (Original figure obtained from authors.) 

things tentatively or imprecisely). Although this article has demonstrated notations 
that are both accessible and precise, there are only a few such examples [Beckwith 
et al. 2005b; Gross and Do 1996] and no work has assessed to what extent end users 
learn more general principles by using these technologies. 

Some work has attempted to explicitly train end users in software engineering and 
computer science principles. Umarji et al. [2008] explored one approach, focusing on 
teaching quality assurance, reuse, and documentation best practices, but the training's 
influence on software quality is not yet known. The authors do suggest, however, that 
training may help end users make more informed decisions about when and when 
not to engage in software engineering activities, relative to their goals and priorities. 
Perhaps professional software developers develop similar instincts through experience 
and these instincts could be distilled into advice that can be taught. 

An alternative to explicitly training users in principles is to find "teachable moments" 
to provide a concrete, contextual illustration of the benefits of software engineering 
ideas. Surprise-Explain-Reward [Wilson et al. 2003; Robertson et al. 2004; Ruthruff 
et al. 2004] is one such approach, aimed at changing end-user programmers' perceptions 
of risk and reward. The strategy consists of these three basic steps. 

(1) Surprise the user in order to raise curiosity about a feature. 

(2) Provide explanations to satisfy the user's curiosity and encourage trying out the 
feature. 

(3) Give a reward for trying the feature, encouraging future use of the feature. 

A simple example of Surprise-Explain-Reward is enticing users to try out the WYSI- 
WYT testing features [Ruthruff et al. 2004], described in Section 3.4.2. One of the 
best ways to surprise users and get their attention is to violate their assumptions. For 
example, the red border in cell Exam^Avg in Figure 12 (grey in this article) may be 
surprising if the coloring is unexpected. If the user hovers over the surprising red cell 
border, a tool tip pops up with an explanation that "0% of this cell has been tested," 
a passive form of feedback that allows, but does not require, the user to attend to it 
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[Robertson et al. 2004]. The user may respond by examining the cell value, deciding 
that it is correct, and placing a checkmark (yO in the decision box at the upper right 
corner of the cell. As described in Section 3.4.2, this decision results in an increase 
of the cell's testedness, changing its color, and more importantly, an increase in the 
progress bar (at the top of Figure 12). Some of these rewards are functional (e.g., car- 
rying out a successful test), and others are perceivable rewards that do not affect the 
outcome of the task (e.g., the progress bar that informs the user how close he or she 
is to completing the testing). Research has shown that such perceivable rewards can 
significantly improve users' understanding and performance [Ruthruff et al. 2004]. 

The same Surprise-Explain-Reward strategy was used in designing the assertions 
described in Section 3.2.2. An empirical study of the feature [Wilson et al. 2003] found 
that although users had no prior knowledge of assertions, they entered a high number 
of assertions and viewed many explanations about assertions. Use of assertions was 
rewarded by more correct spreadsheets, as well as users' perceptions that assertions 
helped them to be accurate. 

One danger with such an approach may be that users may "game the system," using 
the system and its features in order to achieve goals, such as coloredness, other than 
the intended one, namely correctness. This has been observed in studies of computer- 
based learning environments [Baker 2007], where the primary goal is learning, but 
some students learn how to manipulate the system to avoid learning. In the case of 
WYSIWYT, this might mean checking off all of the cells as correct without actually 
assessing the correctness of the cells' values, just to attain 100% testedness. Users 
might do this because they do not understand the meaning of the system's feedback, or 
possibly because the system makes it difficult to avoid using a feature. Because of this 
possibility, the aphorism of "garbage-in garbage-out" comes into play. In lab studies, 
this behavior was not enough to outweigh the effectiveness of WYSIWYT in helping 
users find errors [Burnett et al. 2004]. 

4.3. Self-Efficacy, Gender, and Strategy in EUSE Tool Use 

Even if one accounts for perceptions of risk and reward and makes a significant effort to 
train users about the benefits of a software engineering mindset, there is some evidence 
that personal factors such as self-efficacy and problem solving strategies can signifi- 
cantly influence how effective EUSE tools are at ensuring software quality. Researchers 
have only recently begun to consider the role of these individual differences [Beckwith 
and Burnett 2004; Grigoreanu et al. 2006, 2008; Beckwith et al. 2007; Subrahmaniyan 
et al. 2008]. 

Some of these investigations have been done in the context of self-efficacy, a psy- 
chology construct that represents an individual's belief in their ability to accomplish a 
specific task [Bandura 1977] (not to be confused with self-confidence, which refers to 
one's more general sense of self-worth). Research has linked it closely with performance 
accomplishments, level of effort, and the persistence a person is willing to expend on 
a task [Bandura 1977]. Because software development is a challenging task, a person 
with low self-efficacy may be less likely to persist when a task becomes challenging or 
may calculate attention investment tradeoffs differently [Blackwell et al. 2009]. 

One study considered the self-efficacy of males and females in a spreadsheet de- 
bugging task and how it interacted with participants' use of the WYSIWYT test- 
ing/debugging features present in the environment [Beckwith et al. 2005a]. The result 
in this study and others that followed [Beckwith et al. 2006, 2007] was that self-efficacy 
was predictive of the females' ability to use the debugging features effectively, but it 
was not predictive for males. The females, who had significantly lower self-efficacy, also 
were less likely than males to engage with the features they had been unfamiliar with 
prior to the study (regardless of whether the feature had been taught in the tutorial). 
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Females expressed that they were afraid it would take them too long to learn about one 
of these features, but they actually understood the features as well as the males did. 
Because the females chose to rely on features they were familiar with already, they 
used formula editing rather than the debugging features to debug and, as a result, 
inserted more formula errors than the males. 

Another study considered gender differences in "tinkering," a form of playful ex- 
perimentation encouraged in educational settings because of its documented learning 
benefits [Rowe 1978]. Research suggests that tinkering is more common among males 
[Jones et al. 2000; Martinson 2005; Van Den Heuvel-Panheizen 1999], especially in 
computing [Rode 2008]. Findings such as these prompted an experiment investigat- 
ing the effects of tinkering and gender on end-user debugging [Beckwith et al. 2006]. 
The results found that females' tinkering was positively related to success, whereas 
the males' tinkering was negatively related to success. This was because females were 
more likely to pause between their actions than the males were, leaving more time 
for analysis and interpretation of the changes that occurred due to their action. Also, 
males tinkered more and were less likely to pause. 

These gender difference results led to the design of a new variant of these features, 
which adds explicitly "tentative" versions of the WYSIWYT features, aimed primarily at 
benefiting low-confidence females [Beckwith et al. 2005b]. These changes also slightly 
raise the cost of tinkering, aimed at reducing males' tendency to tinker excessively. 
Follow-on monitoring of feature usage showed encouraging trends toward closing of 
the gender gap in feature usage [Beckwith et al. 2007], and a lab study combining that 
feature enhancement with strategy explanation support showed significant reduction 
in the gender gap in feature usage and tinkering by improving females' usage without 
negative impact to males [Grigoreanu et al. 2008]. 

In a separate line of research, Kelleher and Pausch [2006] investigated issues of 
motivation in the domain of animations and storytelling. The goal was not to identify 
gender differences in performance, but to identify design considerations that would 
motivate middle school girls to tell stories using interactive animations. To do this, girls 
were asked to create detailed storyboards of stories they wanted to tell and annotate 
them with textual descriptions. Analyses of the storyboards revealed a small number 
of animations necessary to support storytelling, including speech bubbles for talking 
and thinking, walking, changing body positions, and touching other objects. These 
features resulted in most of the participants of a study sneaking in extra time during 
class breaks to work on their storytelling projects [Kelleher et al. 2007]. Kafai [1996] 
has also studied gender differences in programming, but for a different audience in a 
different domain: ten-year-old children programming video games. Her work reported 
significant gender differences in game character development and in the kind of game 
feedback that the children programmed [Kafai 1997b]. 

The implications of these findings on the design of end-user software engineering 
tools reach more broadly than just gender: it suggests that there are barriers to success 
at end-user software engineering activities for males and females. The body of work 
also suggests that the designs for features to support end-users can be done in a way 
that helps to remove these barriers, regardless of whether the person encountering 
them is male or female. Future work should better understand not only these barriers, 
but also ways of detecting when such barriers are encountered. 

5. CONCLUSIONS 

Most programs today are written not by professional software developers, but by people 
with expertise in other domains working towards goals supported by computation. This 
article has offered definitions that distinguish this practice from professional software 
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development and it has organized decades of research on incorporating requirements, 
design, testing, verification, and debugging concerns into users' existing work practices. 

What we have found in our review is an early and in-depth focus on testing and de- 
bugging in spreadsheets and imperative event-based languages. Recent work, however, 
is moving in several directions at once: (1) to more platforms and paradigms, including 
the web and mobile devices, (2) to explore a broader array of software engineering con- 
cerns, including specification and reuse, and (3) to a focus more broadly on application 
domains, rather than language paradigm alone. The web in particular is becoming a 
dominant platform on which to study and support end-user programming, with much 
of the work occurring in the past few years [Little et al. 2007; Macias and Paterno 
2008; Nichols and Lau 2008; Toomim et al. 2009]. This is probably due to the rapid 
increase in the use of the web (as opposed to offline desktop applications) to support 
computational work. 

In general, this recent surge in the diversity of end-user software engineering re- 
search is both a blessing and a curse. The sheer diversity of domains that researchers 
are studying may lead researchers to find that the truly difficult problems in end-user 
software engineering arise from the domain itself, and not from the software engi- 
neering challenges. If this is the case, research in end-user software engineering will 
likely shift to better understanding particular domains of practice, rather than on par- 
ticular paradigms or technologies. However, this diversity is also an opportunity: if 
among these widely diverse domains of practice there are fundamental software engi- 
neering challenges, the field of software engineering research has the opportunity to 
dramatically improve the broader use of computational tools in human endeavors. 

Part of this challenge is to consider the generalizability of software engineering tools. 
For example, do these groups need different fundamental software engineering sup- 
port, or do they just need software engineering support tailored to their domain of 
practice? For example, the Why line [Ko and Myers 2004], which began as a debugging 
tool for end-user programming of animations in the Alice environment, was success- 
fully adapted for professional Java programmers [Ko and Myers 2008]. The differences 
between these two approaches are not in the larger concept, but in how the concept was 
tailored to the differing information needs of the two target user populations. Other 
tools, such as the testing and verification approaches described in Section 3.4, trans- 
form "batch" testing techniques to incremental approaches. Others still are traditional 
concepts that exploit properties of a particular language or the types of programs cre- 
ated with a language (for example, exploiting the layout of spreadsheets in a grid). 
Therefore, it is possible that the primary challenges in tool design are not fundamental 
conceptual differences in software engineering, but the adaptation of these concepts 
to particular domains of practice and differing priorities. This makes the previously 
mentioned work on motivation even more important, in that the adaptations may 
be primarily in refraining the presentation and interaction of more general software 
engineering tool concepts. 

Another type of tool generalizability is the extent to which software engineering 
concepts can transfer between paradigms. For example, the notion of interrogating 
program output [Ko and Myers 2004] was adapted to the spreadsheet paradigm [Abra- 
ham and Erwig 2007a] successfully, but the two prototypes have a number of differ- 
ences. For example, the Whyline focused on finding the code that caused or prevented 
a particular output. In contrast, in spreadsheets, finding the formula that caused a 
problem is generally trivial. Instead, Abraham and Erwig transformed the concept 
from one of asking questions to one of stating an expected value. The implementations 
of these systems were nontrivial and have little in common. Therefore, it is possible 
that while the general concepts involved in bringing software engineering concerns to 
end-user programming may generalize, very little else may transfer. Future work will 
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reveal to what extent ideas that support other aspects of software engineering, such as 
specification and reuse, can generalize between paradigms and domains. 

Finally, in all of this research, it is important to remember that the programs that 
end-user programmers create are just small parts of the much larger contexts of their 
lives at work and at home. Understanding how programming fits into end users' ev- 
eryday lives is central to not only the design of the EUSE tools, but our understanding 
of why people program at all. The research on motivations and perceptions presented 
in Section 4 is just a glimpse of the contextual factors that influence programming 
activity. We expect that future work will discover and explain an even greater number 
of these factors and better inform the design of end-user software engineering tools. 
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